310 ◾ Bioinformatics
#3. Combine files into one, sort them numerically, and collapse
redundant entries
sort -n temp.lines1 | uniq > temp.lines
rm temp.lines1
outfq1=$(echo $fq_r1| cut -d’.’ -f 1)
outfq2=$(echo $fq_r2| cut -d’.’ -f 1)
#4. Remove the line numbers recorded in “lines” from both fastqs
awk ‘NR==FNR{l[$0];next;} !(FNR in l)’ \
temp.lines $fq_r1 \
> $outfq1-$minLength.fastq
awk ‘NR==FNR{l[$0];next;} !(FNR in l)’ \
temp.lines $fq_r2 \
> $outfq2-$minLength.fastq
gzip $outfq1-$minLength.fastq
gzip $outfq2-$minLength.fastq
rm temp.lines
Once you have saved the file, you may need to make the file executable by using the Linux
command “chmod”:
chmod +x remove_PE.sh
Then, run the following commands:
./remove_PE.sh ERR1823587_pure_R1.fastq ERR1823587_pure_R2.fastq 50
./remove_PE.sh ERR1823601_pure_R1.fastq ERR1823601_pure_R2.fastq 50
./remove_PE.sh ERR1823608_pure_R1.fastq ERR1823608_pure_R2.fastq 50
Up to this step, we would have removed the host sequences from metagenomic data which are
stored in “ERR1823587_pure_R1-50.fastq.gz” and “ERR1823587_pure_R2-50.fastq.gz” for
the sample of the healthy person, “ERR1823601_pure_R1-50.fastq.gz” and “ERR1823601_
pure_R2-50.fastq.gz” for the moderate sickle cell patient, and “ERR1823608_pure_R1-50.
fastq.gz” and “ERR1823608_pure_R2-50.fastq.gz” for severe sickle cell patient. To save
some storage space, you can delete the other FASTQ files using “rm *.fastq” and also delete
all files in “fastqdir”.
The metagenomic FASTQ files are stored in “fastq_pure” as shown in Figure 8.2. Above,
we have deleted the original FASTQ files from “fastqdir” directory and also the intermedi-
ate FASTQ files from “fastq_pure” directory. You can also delete the SAM and BAM files
from “sam” directory and the reference sequences and indexes from “ref” directory if you
want to save storage space. However, you are advised to keep reference genome files in “ref”
as you may need to repeat all the steps and indexing usually takes a long time.
8.2.4 Assembly-Free Taxonomic Profiling
We can use the FASTQ files to perform taxonomic profiling without metagenome assem-
bly. This approach employs NGS short or long reads present in the metagenomic samples to
assign taxonomic groups by identifying unique genomic regions in the reads. Long reads